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NONPARAMETRIC REGRESSION PENALIZING DEVIATIONS 

FROM ADDITIVITY 1 

By M. Studer, B. Seifert and T. Gasser 

University of Zurich 

Due to the curse of dimensionality, estimation in a multidimen- 
sional nonparametric regression model is in general not feasible. Hence, 
additional restrictions are introduced, and the additive model takes 
a prominent place. The restrictions imposed can lead to serious bias. 
Here, a new estimator is proposed which allows penalizing the non- 
additive part of a regression function. This offers a smooth choice be- 
tween the full and the additive model. As a byproduct, this penalty 
leads to a regularization in sparse regions. If the additive model does 
not hold, a small penalty introduces an additional bias compared to 
the full model which is compensated by the reduced bias due to using 
smaller band widths. 

For increasing penalties, this estimator converges to the additive 
smooth backfitting estimator of Mammen, Linton and Nielsen [Ann. 
Statist. 27 (1999) 1443-1490]. 

The structure of the estimator is investigated and two algorithms 
are provided. A proposal for selection of tuning parameters is made 
and the respective properties are studied. Finally, a finite sample 
evaluation is performed for simulated and ozone data. 

1. Introduction. Let (X, ;,e«), i = l,...,n, be independent identically 
distributed random vectors with X , G [0, l] d . Define the response as = 
r (Xj) +e%. The errors have expectation zero and variance a 2 and are 
independent of Xj- The goal is to estimate r truc (x) given data (Xj,Yi). 

In the full model, we assume only that the unknown regression function 

r truc (x)=E(V|X = x) 

is smooth. Specifically, we assume that r true is twice continuously differen- 
tiate as we will use a local linear estimator. The rate of convergence of 
mean square error is 
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Estimating in the full model suffers from the "curse of dimensionality." 
This leads to consideration of less general models. In the additive model it 
is assumed that the regression function has the special form 

(1) r truc (x) = rX + rX(*i) + ' " ' + rSKjfa). 

The rate of convergence is 0(n -4 / 5 ) as for d=l [Stone (1985, 1986)]. 

Choosing the additive model may lead to serious bias due to neglecting the 
nonadditive component of the regression function. Estimating the full model 
may, however, lead also to a large bias since a large (optimal) bandwidth 
has to be used to achieve the same rate for variance as for squared bias. 

In this paper we introduce a parametric family of estimators f_ ^ (R > 0) 
which includes asymptotically optimal estimators for the full (R = 0) and 
the additive (R = oo) model as special cases. The aim is to offer a continuous 
model choice via the tuning parameter R. 

The philosophy behind additive modeling might be described as follows: 
rather than assuming the strict validity of the additive assumption, one goes 
for the additive part of the underlying regression function to avoid the curse 
of dimensionality. The approach of this paper offers us more flexibility in 
case of highly nonadditive functions: instead of switching to the full model 
(or tolerating a large bias for the additive fit), one chooses a fit in between, 
which takes into account part of the nonadditive structure. 

Local linear estimation. For fixed x, let (3 = (/3o, . . . ,/3d) be the mini- 
mizer of 

i n ( d y \ ^ 

(2) SSR(P,x) = -J2 y»-/9b-E^ i ' k ~ Xk ) ^(Xi.x), 

where K h (~Xi, x) = K(diag(/ii, . . . , /id) _1 (2£i - x))/(Ai /i d ) > is the 
kernel weight of the observation (Xj,^) for the output point x ■ The band- 
widths h\, . . . , hd are scale parameters. We assume that hi, . . . ,hd are of the 
same order and set h = \fh~i ha- The diagonal matrix with diagonal el- 
ements hi, . . . , hd is denoted by diag(/ii, . . . , hd). The local linear estimator 
of r truc (x) at output point x is /So- 
under usual regularity conditions, variance is proportional to (nh d )~ l 
and squared bias is proportional to h 4 . The optimal rate for the MSE is 
n -4/(4+d)^ us j n g a bandwidth h proportional to n~ l ^ 4+d \ The local linear 
estimator achieves asymptotically the linear minimax risk when using spher- 
ically symmetric Epanechnikov kernels. This optimality result was shown 
in Fan (1993) for d = 1 and in Fan, Gasser, Gijbels, Brockmann and Engel 
(1997) for d > 1. For finite sample size, however, regularization is an issue 
[Seifert and Gasser (1996)], as the variance is unbounded in sparse regions. 
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As we will see later, our modeling approach via the parameter R leads, as a 
byproduct, also to a regularization. 

Mammen, Linton and Nielsen (1999) (referred to as MLN below) intro- 
duced a backfitting estimator for the additive model which achieves the 
same asymptotics as the oracle estimator, which is a univariate local lin- 
ear estimator for data (Xj^, Yi — J2 K ^=k r add«(-^*)«)) - Consequently, it in- 
herits the above mentioned optimality. They evaluated a local linear esti- 
mator on a continuum (e.g., [0,1]^), using a vector of parameter functions 
ZL(x) — ( r °(x)> • • • ) rd (x))- The first function r°(x) is the intercept (i.e., [3q 
for x) and the other functions are slopes (i.e., Pi, ■ ■ ■ ,(3d)- MLN decompose 
r(x) into additive (r add ) and orthogonal (zij_) components, and set the 
orthogonal component to zero: 



The estimator £ add has an interpretation as a projection V* r n of the local 
linear r_u to the additive subspace. 

Instead of a projection we use a penalty R to shrink the orthogonal com- 
ponent towards zero. Formally, 



For R = we get the usual local linear estimator, and for R = oo we obtain 
the additive estimator of MLN. For general R we get a family of estimators 
connecting r H with £ add with common additive part V* f_ R = £ add . 

Example. Let us now illustrate the benefit of a smooth choice between 
full and additive models for some simulated data with known regression 
function and random uniform design; see Figure 1. 

Originally this realization of the data was used in Seifert and Gasser 
(2000) in the context of locally ridging the local linear estimator. (Another 
50 realizations are summarized in Section 5.1.) Due to symmetry of the true 
regression function [r true (xi, x?) = r e (x2,X\)], there is no need to consider 
separate bandwidths for each coordinate. Note that the smoothing windows 
have the same size in the interior and at the boundary by choosing a larger 
bandwidth at the boundary [see Figure 1(a)]. We use a product Epanech- 
nikov kernel. The output grid consists of 50 x 50 points and the parameters 
are the minimizers of integrated squared residuals (ISE); see Figure 3(a). 

Even though the regression function is clearly nonadditive, penalizing 
the nonadditive part leads to a remarkable improvement in optimal ISE 
from 8.3 [Figure 2(a)] to 6.0 [Figure 2(d)]. A small penalty R stabilizes 
output points where the local linear estimator is wiggly but has little effect 
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in well-determined regions [Figure 2(b) vs. 2(d)]. This illustrates another 
useful property of penalizing: regularization of the local linear estimator. 

Generalizing a method often improves goodness of fit, while parameter 
selection becomes more difficult. Let us apply AICc for selecting parameters 
R and h [Hurvich, Simonoff and Tsai (1998)]. This criterion tries to find a 




Fig. 1. Simulated data using n = 200 random observations (a) (design n = 200) and re- 
gression function (b) (true regression function) (range [9,54], residual variance a 2 = 25). 
Smoothing windows are of constant size due to increased bandwidth at the boundary: 
(a) displays smoothing windows for h = 0.117 and h = 0.174 at output points (0.55,0.55) 
and (0,0). 



(a) 




(c) (d) 




Fig. 2. Comparison of different estimators. The local linear estimator is either heavily 
biased (a) (h = 0.174, R = 0, ISE = 8.3j or wiggly (b) (h = 0.117, R = 0, ISE = 8.8,). Ad- 
ditive estimation (c) (h = 0.197, R = oo, ISE = 17) is even worse. The penalized estimator 
(d) (h = 0.117, R = 0.163, ISE = 6.0 ) is stabilized without over smoothing: ISE is improved 
by more than a quarter. 
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0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 

R/(1+R) R/(1+R) 



Fig. 3. Comparison of ISE (a) [integrated squared error (ISE)] and AICc (b) as a 
function of bandwidth h (log-scale) and penalty R (j^-scale). The global minimum of ISE 
is at (h = 0.117, R — 0.163). A contour line bounds a region of parameters outperforming 
the ordinary local linear estimator (minimum at h = 0.174/ 

compromise between good fit and small complexity of the model (i.e., low 
trace of hat matrix). Figure 3 shows that parameter selection is successful 
in this example. 

Contents. The paper is organized as follows: Section 2 defines the pro- 
posed estimator both in a discrete and in a continuous version. A computa- 
tionally efficient direct and an iterative algorithm are developed in Section 3. 
Properties of the estimator are studied in Section 4: the penalized estimator 
is shown to be a pointwise compromise between an additive and the local lin- 
ear fit. A decomposition into an additive part and an orthogonal remainder 
term is derived, where only the nonadditive part involves shrinking. In addi- 
tion to model flexibility, the approach offers a regularization in sparse regions 
of the design. We then justify the interpretation of the local linear and the 
additive estimators as special cases of £ r for R^0 or oo. The convergence 
of f_ R to the MLN estimator for R — > oo is investigated. The data- adaptive 
simultaneous choice of h and R is analyzed to some extent. Section 5 is 
devoted to a simulation study. Furthermore, the estimator is applied to the 
ozone dataset. A summary of the contents is provided at the beginning of 
each section. Software is available on our homepage www.biostat.unizh.ch. 

2. Definition of the penalized estimator. A local linear estimator is eval- 
uated on a set of output points. For penalizing deviations from the additive 
model, these output points should form a product space. In Section 2.1 we 
choose the interval [0, l] d as a continuous set of output points and start with 
definitions similar to MLN. This choice is suitable for deriving theoretical 
properties. In practice the continuous set of output points is approximated 
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by an equidistant grid, 



12 1 f 1 

0, -, -,...,1 x ••• x 0, 



mi — 1 mi — 1 J L — 1 ^-d — 1 

as in Section 2.2. 

2.1. Estimation on an interval. We will introduce a Hilbert space (J-, \\ ■ ||, 
such that the local linear estimator r if corresponds to a projection of the 
response Y to some subspace C T . 

MLN consider a subspace T^&a C .Ffuii of additive functions and obtain 

i-add 

by projecting Y to T^. 

We consider another norm || • ||r being the sum of || • ||* and some penalty 
with parameter R on the squared distance from J-'add- The penalized esti- 
mator r_ R is the projection with respect to || • \\r of Y to J-i u \\. 

Define the vector space of (n + l)(d + 1) functions 

T = { r = (r^li = 0,...,n;£ = 0,..., d)\r^ : [0, l] d K}. 

Let us define the projection Vq on which replaces r*' by r 0,i . In other 
words, if r = VoL, then r i ' (x) = r°^(x). The image of "Po is denoted 
by .Ffuii- For simplicity of notation, the index i is omitted: 

^fuU = {L = (r°, . . . ,r d )\r e : [0, l] d -> R, t = 0, . . . ,d}. 

The observations Y, i = 1, . . . , n, are coded as r Y £ J- by 

\ / Yj, for i > and £ = 0, 
y \ 0, otherwise. 

Define the design-dependent seminorm || • ||* on T by 



HI 2 



1 n 

-E 

i=i 



i,0/ \ , i.fc/ \Xi,k x k 

' ' x) + > r ' (x; 

A-l 



2 



where iiu,(X,-, x) is the kernel weight of the observation (X,-, Y) for the 
output point x- 

Hence, for r G .Ff u ii we have 

1 n r 

(3) kv-dl2=/^E 



n . 



Yi-r°U)-J2r k W: Xi ' k Xk 



hi 

k=l K 



and the integrand corresponds to the minimization problem for the local 
linear estimator. Consequently, we denote the minimizer by 

The interpretation of f_u as projection of r Y to J-f u \\ was developed by 
Mammen, Marron, Turlach and Wand (2001) and is quite useful when in- 
corporating constraints, that is, minimizing (3) for r in a subset of J-f u i\. 
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Consider now an additive subspace J^dd C .Ffuii: 

•^add = {l£ •? r fuiik (x) is additive; 

for k = 1, . . . ,d, r fc (x) depends only on x^}- 

Define the additive estimator £ a dd as the minimizer for r € J- a dd of \\r Y — 
L\\l 

Projecting a Nadaraya- Watson estimator to an additive subspace was first 
considered by Nielsen and Linton (1998) for d = 2 and extended to higher di- 
mensions in MLN. The projected local linear estimator £ add was introduced 
by MLN and has attractive properties. Nielsen and Sperlich (2005) discuss 
practical aspects of this estimator, which is called smooth backfitting there. 
These include implementation, parameter selection by cross validation and 
finite sample evaluation. 

Let us introduce further notation. Define the L -norm on T by 



-i n d . 



Denote by "Padd the || • ^-orthogonal projection from ^f u u into ^"add- 
More formally, we define r add = V ad dL via r° dd (x) = J2k=l I r °( * ) dx_ k - 
(d—1) J r°(x)dx and ^a dd (x) = f r k (x) dx_ k , where / • • • dx-fc denotes the 
integral with respect to all components of x except x^. Furthermore, let V* 
be the || • ||* projection from T (or T^a) to .Fadd! see Appendix A. 0.10. 

Next, a penalty on the nonadditive part of r is added to || • ||*. Define the 
seminorm || • \\ R on J-: 

\\r\\ 2 R = \\r\\ 2 ,+R\\(3-V add )V r\\l 

where J is the identity. The penalized estimator f_ R is defined as the mini- 
mizer of 

2 



|2 / x 

-Y ~ L\\R ~ 

i=l 



Kh(~Ki, x)dx 



(4) 

+ RW-V add )r\\ 2 2 

for r £ -^fuii- For the penalty term we use the fact that Vq is the identity 
on .Ff u ii and that VqZy = 0- The latter was the reason for introducing the 
components with i = in T . Properties of r_ R will be analyzed in Section 4. 

Remark on the choice of the penalty in (4). For any choice of 
the penalty, the MLN estimator £ add would be the additive part of r_ R with 
respect to the norm || • H*, assuming invariance under addition of an element 
•^add to r in (4); see Proposition 4 in Section 4.2. 

We used || • H2 for the penalty instead of || • ||* because the latter is inferior 
in sparse regions. Moreover, V* is not self-adjoint with respect to || • [U. 
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2.2. Estimation on a grid. Now an approximation of (4) on a grid is 
derived. 

Let the output grid 

{*!,... x {tf, . .. , t„ 2 } X • • • X {tf, . . . ,tm d } C [0, l] d 

consist of rrik values for the &ith coordinate and enumerate its m = mi x 
• • • x md output points by t j = (tj,i, • ■ • , tj t d) for j = 1, . . . , m. In order to get 
an appropriate approximation of (4), the output grid has to be sufficiently 
dense and has to increase with n. Denote by (3^^ G M. d+l the parameters 
of the local linear estimator at tj. The parameter space for the local linear 
estimator on the output grid is 

Ffcu = {§_ = coljd&fiypW G R d+1 } = R rn ( d+1 \ 

where coL(/3^) denotes the column vector obtained by vertically stacking 
/3«,...,/3 (m) . The accompanying norm is the Euclidean norm || • ||. The 
additive subspace is defined as 

F add = {coL;(^))|3 r add G ^ add : ffi = ^(t,)}. 

Let P a dd be the orthogonal projection from Ff u n to F a dd- The local linear 
estimator fl{p at output point tj is the minimizer of the sum of weighted 
squared residuals SSR [see (2) in Section 1]. The simultaneous local linear 
estimator on the grid minimizes the sum of SSR over all output points tj. 
Finally, we add a penalty proportional to the squared distance of the pa- 
rameters to the additive submodel, 

m 

(5) l R = argmin^ SSR{^\ tj) + - P ad d)£|| 2 . 

The penalized estimator ?r is the intercept of /3^, that is, rn(tj) = \J3 ^]q. 
An efficient algorithm will be presented in Section 3.2. 

3. Dimension reduction and algorithms. In this section we derive algo- 
rithms for calculating the local linear estimator with nonadditivity penalty 
on a grid. In Section 3.2 we derive a formula for computing (3 R which avoids 
storing and inverting large matrices. An iterative algorithm using these con- 
cepts is provided in Section 3.3. Modifications for large R are also discussed. 

3.1. Notation and normal equations. Define for k, k = 1, . . . ,d: 
1 n 

Sb,o(x) = -2]K h (Xi, x), 
n 
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<Sb,fc(x) = Sfc,o(x) = - y2K h (Xi, x)— ^- : 

n r-t hk 

i=i * 

Sk, K U) = -Z^ R h(2ii, X) 7 7 , 

lb /l^. /t K 



and for k = 1, . . . ,d: 

1 n 

A)(x)=-y;^fc(x i ,x)ri, 



n . 
i=i 

^(x)=-E^(^,x) 'r Xfc y '- 

Denote by S(x) the (d + 1) x (d + 1) matrix with elements S^z(x) and 
L(x) = col^ = o,...,d(^(x)). 

Let S( J ) = S(tj) and iS^ = L(t j). The normal equations (3^ = argmin^(j) SSR{(3^\ tj) 
for the local linear estimator at tj are 

s(;)3 ( ^ = l«. 

Similarly, for simultaneous local linear estimation on the whole grid we have 
§.11 = co Wzf )' S = dia gj(S (i) ), L = col^L^ ) and 

The normal equations for the penalized estimator (5) are 

(6) (S + i?(I-P add ))3 fl =L. 

3.2. Dimension reduction. Simultaneous estimation on a grid requires a 
large number of parameters. Dimension reduction is necessary for computa- 
tion. 

The normal equations (6) for f3 R are ((S + i?I) — RP ^d) (3 R = L. Because 
S + RI is a block-diagonal matrix and i?P a( jd has low rank, solving the 
normal equations may be simplified using matrix algebra [Rao and Kleffe 
(1988), page 5, and Appendix A. 0.5 here]. We decompose P a dd into a product 
Z T Z. Using the abbreviation = R(S + RI) -1 , we obtain 

(7) (3 R = (I + A R Z T {I - ZA fi Z T r Z)(S + El)' 1 L , 

where {•}" denotes any generalized inverse. 

The matrix Z has rank 2m* + 1 — d, where m* = m\ + • • • + m^. In Ap- 
pendix A. 0.3 an explicit choice for Z with dimension 2m* x m(d+l) is given. 
The multiplication Z/3 consists mainly of 2m* sums of totally 2dm terms. 
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Similarly, calculation of ZA R Z T from A R leads to (2d) 2 m summations. 
Calculation of A^ from S is of order d 3 m operations. 

Formula (7) leads to a feasible algorithm because the dimension of the 
matrices to be inverted is relatively small compared with (6). 

An oblique projection. Let us define an oblique projection in order to 
simplify formula (7): 

P SiR = Z T {(I - ZZ T ) + Z(I - A fi )Z T }-Z(I - A R ). 

In Appendix A. 0.5 we show that Ps,j? is the orthogonal projection from Ff u n 
to 

Fadd with respect to the inner product ((3,(1 — A^)/3). In particular, 
(I - A R )P S , R is symmetric and Pg^I - A R )(I - Ps,r) = 0. 

Because 1 — A R = (S + i2I) _1 S and S/3 = L, we substitute (S + i?I) _1 L 
in (7) by (I — A#)/3„ and obtain 

(8) (3 R = A fl Ps,fl3„ + (I - A H )3 ir 
See Proposition 1 in Section 4.1 for interpretation. 

Modification for large R. For large R,l — A R is of order R~ l and Ps,i? is 
hence numerically unstable. Because Ril — A R ) = (I + i? _1 S) _1 S is suitable 
for large R, we modify Ps,i? by multiplying both terms I — A R by R. 

Note that A R = (I + i?~ 1 S)~ 1 . Formula (8) for large R then becomes 

(3 R = (R~ 1 I + A R Z T {(1 - ZZ T ) + Z(A i? S)Z T }^Z)A /? L . 

3.3. Iterative calculation of the penalized estimator. We provide in addi- 
tion an iterative algorithm for the penalized estimator. This avoids inversion 
of the matrix I — ZA R Z T and even its calculation. 

We use the fact that Padd/3^ — ^*s,rP u holds (Proposition 2, Section 4.1) 
to calculate Ps,_r/3,, iteratively via (8), 

l [a R +1] = AflPaddgg + (I - A R ){3 U . 

Only the additive part 7 M of (3^ R is iterated: 

(9) 7> +1 ] = ZA R Z T 7 M + Z(I - A fl )| H , 

where 7^ = Finally, set 

3 i? = A if Z T 7M + (I-A iJ )3 H . 

Uniqueness of (5) implies that I — ZA R Z T is positive definite [proof in Studer 
(2002), Appendix B.l]. Accordingly, we have exponential convergence due to 
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fixed point iteration and contraction; see Table 1. In case of nonuniqueness 
of (3 , the algorithm still converges (see Appendix A. 0.6). 

The squared difference between the intercepts of f3 R and f3 R in Table 1 
diminishes quickly and is negligible for a > 3 compared with the ISE. The 
starting value was /3 R = 0. 

Modification for large R. Algorithm (9) converges because ZArZ t is a 
contraction. The eigenvalues of Ar are, however, increasing with R and A^ 
has the identity I as limit for R — > oo . Therefore convergence is slower for 
large R and does not work for R = oo. 

For large R, we choose a > such that aS < I and use 

(10) ZP R = Z(I - aR(I - A R ))Z T Z§_ R + aZR(I - A R )0 U 

for iterations instead of (9). 

Generalizations. The derivations for (7) assume only that Z T Z is a pro- 
jection. Hence, if F a( jd is replaced by another subspace F su b, say, and Z is 
modified such that Z T Z is the orthogonal projection from Ff u u to F su b, then 
the above algorithms remain valid. Generalization from local linear to local 
polynomial estimation is achieved by corresponding modification of S and 
L. 

Implementation is simplified by the fact that Z need not have full rank. 
For the iterative algorithm (9), moreover, there is no need to calculate the 
matrix Z explicitly. For example, if F su b corresponds to using bivariate in- 
teraction terms in the additive model or postulating the same regression 
function for subgroups, multiplication by Z, Z T and ZA R Z~^ can be imple- 
mented efficiently 

4. Properties of the estimator. In this section, we evaluate the effect of 
the nonadditivity penalty on the estimator. Both on a grid (Section 4.1) 
and on an interval (Section 4.2), the penalized estimator turns out to be 
a pointwise compromise between the local linear and some (i?-dependent) 
additive estimator. The compromise depends on how well the local linear 
estimator is determined locally. This is an attractive property as it leads 
to automatic regularization in sparse regions (provided that the additive 



Table 1 

Convergence of iterations for the estimator m in Figure 2(d) 



a 





1 


2 


3 


4 


5 


ISE 


\\[$R — J intercept | 2 


463 


15.5 


0.8 


0.1 


0.02 


0.008 


6.0 
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estimator is well determined). The additive part of f_ R is studied using two 
different norms (Propositions 3 and 4). 

Later on, we focus on the smoothness of the model choice via the penalty 
parameter R for fixed n. Continuity in R for R £ (0,oo) is obvious and the 
cases R = and oo are investigated in Section 4.4. We investigate the rate 
of convergence of £ R to Laddi depending on whether or not r truc is additive. 
In both cases we find a rate for R such that || r R — z a dd 111 ^ s °f smaller order 
than n -4 / 5 . In Section 4.5 we consider the data-adaptive choice of R and h. 
In Section 4.6 we see that in the case of fixed uniform design (d < 4) f_ R 
with data-adaptive R is equivalent to £ add for additive functions. 

4.1. Properties of the estimator on a grid. We investigate the effect of 
the penalty on estimation at one output point: the penalized estimator is a 
kind of convex combination between a local linear and an additive estimator. 
Furthermore, the local linear estimator may be decomposed into a sum of 
additive and residual components. The penalized estimator is the sum of the 
additive part and shrunken residuals, which are orthogonal to the additive 
part. 

In Section 3.2 an oblique projection Ps,_r was introduced, leading to 

(3 R = A fl P s ,fl3 H + (I - A R )P U 

in (8). Recall that A R = diagj(i2(S^^ + -RI)" 1 ) is block- diagonal with eigen- 
values between zero and 1. Let us see what (8) implies for one output point. 
Denote by (Ps,fl/3^)^ the components of Ps,rP a corresponding to output 
point tj, formally Ps,R§_ u = coL,((P s ,,r3 h ) (j,) ). 

Proposition 1. The penalized estimator (3^ R is a pointwise compro- 
mise between some (R-dependent) additive fit (Ps,R0n) ana " the local lin- 
ear fit 

If = (sU) + M)- 1 {i?(p s ,i ? 3 H )(^ + s {j) l\p}- 

In sparse regions the local linear estimator is unstable [Seifert and Gasser 
(1996)], because S^) may be nearly singular. The above formula indicates 
that penalizing solves this problem as a byproduct, because the additive 
part of (3 R is stable under weaker conditions. This regularization property 
is illustrated in Figure 2(b) versus 2(d). When all eigenvalues of S^^ are 
large, the effect of a small penalty R vanishes. 

We derive now a decomposition of (3 R into an additive component and 
an orthogonal remainder term. Formula (8) is equivalent to 

(11) l R = Vs,r(3 u + (I - A R )(I - Ps,fl)3„. 
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Only the nonadditive part involves shrinking. 

Proposition 2. The following relations hold: 
Padd3 fl = Ps,Jl3,, and (I-Padd)3 fl = (I-Afl)(I-Ps,ji)3 ir 

The proof is in Appendix A. 0.7. 

4.2. Properties of the estimator on an interval. Now we will show that 
Propositions 1 and 2 hold not only on a grid but also on an interval. Propo- 
sition 4 states that the additive part of f_ R with respect to V* is r add , 
independent of R. 

Define the symmetric, continuous operator 5* : J-{ u ii —> J'fuii, L l— > L by 

r°(x)\ /5o,o(x)r°(x) + ••• + ^(xjr^x) \ 

r d U)J Ud,o(x)r°(x) + ••• + S d4 U)r d U)J 

(see Section 3.1). We have by construction that ||zi||* = (r><5*r)2- Let r_L G 
•^luii with r£(x) =L^(x). The normal equations for £ w , the minimizer of 
(3) in Section 2.1, are 

S*Lu = Ll- 

The normal equations for r R , the minimizer of (4), are 

(12) (S* + R(3-r add ))r R =r L = S*? ll . 

Let "P^i? denote the orthogonal projection from .Ffuii to F^dd with respect 
to the norm || • \\ R . According to MLN, V* r is continuous with probability 
tending to 1 for n— > oo, under regularity conditions for the design density 
and kernel [Conditions MLN:B1 and MLN:B2' in Appendix A.0.2]. In par- 
ticular, the bandwidth h is of order n -1 / 5 or larger [Condition C1+]. An 
explicit formula for V* t R is given in (26), Appendix A. 0.8. 

Then f_ R may be decomposed similarly to (8) and Proposition 1: 

(13) r R = {(S* + BS)- X BP^ + («S* + R3)~ 1 S^}r_u- 

Note that (5* + i?J) _1 5* is a pointwise (in x) matrix multiplication. Fur- 
thermore, (5* + i2J) _1 5* and (5* + R3)~ 1 R sum to 3 and have eigenvalues 
between zero and 1. Hence, (13) indicates that r R (x) is some kind of convex 
combination of r_ u and V* t R?ii- 

Similarly to Proposition 2, the above formula may be rewritten as 



(14) 



l r = {v*, R + (s* + rj^s^j - p*, r )}l u . 
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Proposition 3. The following relations hold: 

V a ddLR = 'P* t RLu and (J — 'Padd) Lr = (<S* + RJ)~ 1 S* (3 — V*^) r n . 

In Section 4.4 we will see that V^aLr — £ a dd is 0(R~ 1 ), for fixed n. The 
i?-dependence of the additive part V^Lr can be avoided when using the 
oblique projection V* instead: 

Proposition 4. V*lr = £ a dd holds. 

This means that the MLN estimator is the additive part of f_ R with 
respect to the norm || • ||*. Both proofs can be found in Appendix A. 0.8. 

4.3. Bounding and V* r. If So t o(x) is a uniformly consistent density 
estimator, the operators S* and V* t R are shown to be bounded. This property 
will be used in Section 4.4. 

Let 2, sup denote the supremum norm of based on the Euclidean 
norm on .Ff u ii. Here, we want to find upper bounds for ||«S*||2,sup> that is, 
a uniform bound for the maximum eigenvalue of S(x). Because the ker- 
nel is bounded with compact support by Condition MLN:B1, the maximal 
eigenvalue of S(x) is of order So,o(x) - Note that So,o(x) is a kernel density 
estimator of /. We are thus interested in uniform boundedness from above 

of /(x) = Sb,o(x)- 

Silverman (1978) derived uniform consistency of kernel density estimators 
for d=l. We will use a result of Gao (2003), which asserts uniform consis- 
tency for density estimators for continuous densities on W 1 and bandwidths 
h satisfying 

nh d 

(15) h^O and - — - — 77 — > 00 as n — > 00. 

log(/i~ 1 ) 

By Condition C1+ these conditions are satisfied for d < 4. For d > 5, the 
condition (15) is not satisfied for the optimal bandwidth oc n" 1 / 5 of the 
additive model. We thus lose flexibility in the model choice when (15) is 
assumed. 

Proposition 5. Under Conditions MLN:B1, MLN:B2' and (15), ||S*|| 2 ,su P 
is uniformly bounded with probability tending to 1 for n — > 00. 

Note that for fixed n, this norm is always bounded because of Condi- 
tion MLN:B1. The i?-dependent projection V^^r may be bounded uniformly 
in R using Proposition 5 and Lemma 2 in Section 4.4: 

Lemma 1. Under Conditions MLN:B1, MLN:B2' and Cl+n (15), ||P*,k||2,sup = 
Op(l), uniformly in R. 

The proofs are in Appendix A. 0.9. 
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4.4. The additive and full models as special cases. Here we justify the 
interpretation of r_u and £ add as special cases of r_ R for R = and R = oo, 
respectively. This is appreciated because r_u an d £ a dd are known to be 
asymptotically optimal for the respective situations. The rate of convergence 
of r_R to £ add f° r R * 00 depends on the supremum norm of 5* and whether 
or not the regression function is additive. 

We will start with the convergence of f_ R to ? u for R j 0. Consider the 
case where S(x) _1 is uniformly continuous in x- This is a sufficient con- 
dition for bounded variance of r u and represents therefore the well-behaved 
cases. In this case, S* has a continuous inverse and the limit of f_ R for R { is 
r_n- Let us mention that uniform continuity is a stronger assumption than 
uniqueness of r w . Uniform continuity means that for any r £ ^fuii with 
|| r || 2 = 1 the norm ||r ||* is bounded away from zero, whereas uniqueness 
needs only a nonzero norm. If ?n is not well determined, a small positive 
penalty provides the desired regularization. 

Mammen, Linton and Nielsen (1999) showed that the additive estimator 

fadd = argmin||r y - r\\l 

r G^add 

is asymptotically oracle optimal under Conditions MLN:B1-B4' and CI for 
additive regression functions as in (1). Let us therefore examine the conver- 
gence of f_ R to £ add for R^ oo. Decompose via 

(16) \\Lr ~ £addll2 = W ~ ^add)£iil|| + WPaddtR ~ faddlll- 

For bounds of the first term of the sum, see Lemma 4 below. 

Recall that V^ddLR = 'P*,RLu holds by Proposition 3. Similar to the mod- 
ifications for large R in Section 3.2, we introduce an alternative formula for 
V*,rLii [defined in (26), Appendix A.0.8], 

(17) V ad dL R = (P add (J + J R" 1 ^)" 1 cS,P add )^ add P add (J + R- 1 ^)- 1 r L , 

where (• • -)\^ dd indicates that the expression is an operator on ^" add . Define 
Sadd = ('PaddS^addV^- Solving the normal equations for £ add leads to 

Ladd = ^add^add Ll- 

The right-hand side is equal to the right-hand side of V^Lr for R~ l = 0. If 
-R -1 1| 5* || 2, sup tends to zero, we may use a Taylor approximation and obtain 
WPaddLR ~ faddlb = CM-R^H^Iksup); see Lemmas 2 and 3. 

Lemma 2. Under Conditions MLN:B1, MLN:B2' and C1+, 5 add con- 
verges for n — > oo to an operator with continuous inverse. 

Hence, S~^ d is continuous with probability tending to 1 for n — > oo and 
Il^addll2,sup is uniformly bounded in n, Vn > li, with probability tending to 1 
for n — > oo . 
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The proof is given in Appendix A. 0.10. Uniqueness of £ add 1S equivalent to 
llladdll* > for a11 Hadd £ ^add-{Q}- Lemma 2 states that || r add ||*/|| r add || 2 
is bounded away from zero with probability tending to 1. 

Lemma 3. Under Conditions MLN:B1, MLN:B2', MLN:B3' and C1+ 
we obtain for fixed n (i.e., conditional on the data) 

sup |P add (J - (J + ir 1 ^ 1 ) r L (x)| = 0(RT l ) 

X 

and UPaddHLlb is finite. 

For increasing n, we obtain 

sup |P add (:J - (J + iT 1 ^)- 1 ) r L (x)| = P {R- 1 ||5*|| 2 ,sup). 

x 

Furthermore, 

\\V, dd r L \\ 2 = P (l). 

A proof is given in Appendix A. 0.10. 

Lemma 4. The following bound holds: 

W -V add )r R \\l < 2ir 1 ||& ft || 2 ,sup||z:y - faddlMlfii- faddlb- 
Under Condition MLN:B3' we have 

2 , 1 v 2 _ ( finite, for fixed n ( and Y ), 



\LY ~ I add II* -~^2 Y i 



\Op(l), for increasing n. 



See Appendix A. 0.10 for a proof. Using (16) we obtain: 

Theorem 1. Assume Conditions MLN:B1, MLN:B2', MLN:B3' and 
C1+. For fixed n, 

\\L R -L &dd h = 0{R- 1 ) 

holds with probability tending to 1 for increasing n. Formally, this means 
Ppimsup J i_» 00 12||r fl - £ add || 2 < oo] n= ?° 1. 
For n — > oo 

\\LR - Laddh = Op(-R _1 ||'5*||2,sup)- 

Note that this holds also for nonadditive regression functions. For additive 
regression functions we obtain a better bound; see Theorem 2. 

Applying Proposition 5 to Theorem 1, for h oc n" 1 / 5 (Condition CI), d < 4 
and R~ l = o(n -2 / 5 ), we get \\f_ R — r add ||2 = op(re -4 / 5 ). For h oc n -1 / 5 and 
d > 5, IISJI 

2, sup is not bounded by a constant and R needs to converge 
faster to oo to achieve equivalence. Alternatively, one might use a larger 
bandwidth. (Without proof.) 
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Theorem 2. Assume Conditions MLN:B1-B4', d < 4, h oc [nT 1 / 5 , n _1/(4+(i) ] 
and an additive regression function (1). TTien 



ffl- faddll2 = Op 



1 



RVnfv 1 



For h oc n x / 5 and additive regression functions, R 1 = o(n ^ 1 )/ 10 ) is 
sufficient to obtain equivalence of r ^ and r add . The proof is in Appendix A. 0.10. 

4.5. Data-adaptive parameter selection. We consider the simultaneous 
choice of the tuning parameters R and h. In the case of an additive regres- 
sion function r truc , the first-order bias of is independent of R and pa- 
rameter selection is asymptotically equivalent to the classical variance/bias 
compromise. Hence, h oc n -1 / 5 and R — > oo. The rate of R is investigated in 
Section 4.6. 

Asymptotically, we have only to consider the cases r true = additive or full 
model, and the question is then whether R is able to identify these cases. 

We consider parameter selection criteria that depend on fitted values at 
design points. The vector of fitted values Y = coh = i i ... >n (r^(Xi)) = MrY 
depends linearly on Y, where M/j is called "hat matrix." 

In practice, the estimator is computed on a grid and we need some inter- 
polation to obtain estimates at the design points. Define 

I fl (Xi) = (S(Xi) + Riy^SiXi^iXi) + fl(P Bl Ji£ H )(X i )}, 

where (Ps,fl/3,,)(Xj) denotes the interpolated value at Xj of the additive 

part Ps,R.(3 u - Therefore, Y = colj([/3^( Xj)]o) is a linear combination of Y 
and construction of Mp is obvious. 
We consider the following criteria: 

AIC(i?, h) = log(5 2 ) + 2tr(M fl )/n, 

GCV ^^- (l-tr(M.)/nr 
AIC c (*,*)=]og(*V * 



l-(tr(M fl ) + 2)/n' 

where a 2 = ^|| Y — Mr Y || 2 and tr(M#) denotes the trace of Mb, which 
is interpreted as degrees of freedom. AIC and GCV are classical model se- 
lection criteria [see, e.g., Hastie and Tibshirani (1990)], and AICc was in- 
troduced by Hurvich, Simonoff and Tsai (1998). These criteria are justified 
for r n only when (15) is satisfied. As we want to analyze the ability of it! to 
identify the additive model with its optimal bandwidth (Condition CI), we 
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will assume d < 4 throughout this section. Moreover, we will assume that 
E[e 4 ] < oo. 

Let us compare the criteria (r = — tr(Mij)) 
AIC = log(a 2 ) + 2r, 

log(GCV) = log(a 2 ) + 2 ( t + ^ + ^ + • 



AICc - 1 = log(a 2 ) + ^ + 2^ 



2 3 

(n — l)n k T k 

fc>] 



^ (n-2) fc+1 



«log(a 2 ) + 2(r + r^ + r d + •••)• 

All are of the form log(a 2 ) plus some penalty against undersmoothing; 
see Hardle, Hall and Marron (1988). As the penalty increases from AIC to 
log(GCV) and further to AICc, minimizing these criteria leads to increas- 
ingly more smoothing (decreasing T m j n ): According to Hurvich, Simonoff and 
Tsai, AICc avoids the large variability and the tendency to undersmooth of 
GCV and classical AIC observed when estimating bandwidths for d = 1 . Note 
that AIC has its global minimum at interpolation (h = 0, R = 0), leading to 
a 2 = and t = 1. Undersmoothing, however, contradicts the aim of this pa- 
per and is avoided by assuming h oc [n" 1 / 5 , n 

-V(4+*0] a nd R > R mhl (h). The 
lower bound R m i n (h) is chosen to bound the variance of r_ R by the optimum 
rate n~ i l <yi+d \ This condition does not rule out the asymptotically optimal 
additive estimator (R = oo,h oc n" 1 / 5 ). Then -tr(M^) — > as n — > oo and 

g-lisOp(^). 

Hence, we may use the approximation log(<r 2 ) = log(<7 2 ) + ^ — 1 + Op(^). 
Define the Taylor approximation of AIC — log(cr 2 ), 

a 2 2 
AIC T = ^-l + -tr(M i? ). 

The expected value of r H is for additive regression functions 



(18) 



A- 



The leading terms are in J- a dd an d hence unchanged under multiplication 
by Vx^r, and the op(h 2 ) terms remain small enough because of Lemma 1. 
Accordingly, (I — Mij)r^^ = 0(h 2 1 ), and the first-order terms of the bias 
of Y are independent of R. 
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Next, we need to ensure that f_ R is not degenerate. Using (14) in Sec- 
tion 4.2, we choose some small constant -R m i n > 0, assume that R > R m i n , 
and consequently (5* + R3)~ <S*£^ is stable (ridge regression). With prob- 
ability tending to 1, S~ dd is continuous (Lemma 2), that is, r add is stable. If 
both 5* and S^ d are bounded, we do not have to worry about stability of 
'P*,RLn (see also the proof of Lemma 1). Therefore, ||M]jMp|| SU p = Op(l). 
Obviously, when S~ 1 is continuous, we need not assume that R > R m \ n . 

Therefore, ^((1 - M R )s, (I - Mr) r^> = Op(^), which is o P (h 4 ) for 
h as in Condition C1+. Note that a 2 tr(M R ) = E[(e, M R e )]. Hence 

AIC T - f^||£|| 2 -l 



-^E[||Mp £ || 2 ] + -^||(I - M*)r*3f 



+ ^(l-E)[||Mp£|| 2 + 2(e,M R e)] + Opf^=), 

where (1 - E)[(e,Me>] = <e,Me) - E[(e,Me)]. 

Lemma 5. ForE[e 4 } < oo, var((e,Mpe.)) = 0(E[||Mp£|| 2 ]). Moreover, 
if HMjMplUp = P (1), then var(||Mp£ || 2 ) = P (E[\\M R e || 2 ]). 

Because E(||Mp e || 2 ) is of order h~ d and h^ 1 for i? = and R = oo, 
respectively, the standard deviation of (1 — E)[- • •] is of smaller order. The 
proof is given in Appendix A. 0.11. 

The leading terms of AICt — (— j|| £ || 2 — 1) are a variance/bias compro- 
mise 

- E[||Mps|| 2 ] + -i,||(I-Mp)r^ d e || 2 , 



na 2 na 2 



which is minimized for R = oo and h oc n x / 5 . Consequently for AICt: 

Proposition 6. Under the assumptions of Theorem 2 and E[e 4 ] < oo, 
h achieves the rate n -1 / 5 and R— > oo (with probability tending to 1). 

If the true regression function is nonadditive, any R -/* induces a bias of 
order 0(1), that is, an AICt — (^rll £ II 2 ~~ 1) °f 0(1). On the other hand, in 
well-behaved cases (continuous 5" 1 ), the optimal AICt — (• • ■) of the local 
linear estimator is of order 0(n _4// ( 4+rf )), leading to R — > and h oc n -1 /( 4+rf ) 
in these cases. 
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4.6. Investigating the rate of R for AICt- Data-adaptive parameter se- 
lection is studied for fixed uniform design and additive regression functions; 
in this case, the penalty R is large enough such that £ R — £ add becomes 
negligible. 

As seen before, we can restrict ourselves to the case h oc n -1 / 5 and R — > oo. 
In order to simplify the structure of r^fX,), we assume a fixed uniform 
design: S(Xj) is diagonal and constant in the interior. Furthermore, we 
ignore boundary effects and /^-dependency of V* tR . This allows us to simplify 
(13) as 

(19) r°(X.) = °(X,) + ^ d (X ( ). 

By (19) we have M R = \M U + (1 - A)M add with A = Hence, AIC T is 
a polynomial of degree 2 in A, 



AIC T = A 2 (-^||M H e || 2 + -L||M add £|| 2 - — ^M^M^e; 
L na z na z na z 

+ \\-^E[(M ll £,M add e)) - ^2||M add e || 2 

+ — (1 -E)[(e,Mag) + (M z/ £,M add £) - (e ,M add e)]l 
ncr z J 

+ ||(I - M R )r^f + P (h*/V^) + (^Ikll 2 - l) 

+ -L||M add £|| 2 + -^(1 - E)[(e l M add e)]}. 

In Appendix A. 0.12 we show that the dominating terms of A 2 and A 
are ^E[||M H £|| 2 ] oc ^ and (1 - E)(e,M„e> = P {-^) } respectively, 

leading to A min « - ( 1 - E) ( e , M« e > /E [2 1 1 M„ e 1 1 2 ] = O p ( h d ' 2 ) . 

Proposition 7. Under the assumptions of Theorem 2, fixed uniform 
design and E[e 4 ] < oo, we obtain R~ l = Op{n~ d l 10 ) for AICt- 

(See Appendix A. 0.12 for a proof.) By Theorem 2, R grows fast enough 
to ensure || ? R - £ add || 2 = Op(n" 4 / 5 ). 

Note that the rate of the minimum is not affected by the M add -dependent 
terms. Accordingly, assuming V* R = V* is not critical. 

Comparison of AIC C and AIC T . By (19), tr(M B ) = Atr(M K ) + (1 - 
A) x tr(M add ) is monotone decreasing in R, because (asymptotically) 1 > 
£tr(M,,) > ^tr(M add ). If we add ^ tv(M R )) 2+i {I > 0) to AIC T , A min be- 

comes smaller. Hence, AICc [with log(er 2 ) replaced by log(cr 2 ) + ^ — 1] 
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chooses R at least as large as AICt- However, this effect is asymptotically 
negligible as the leading terms are unchanged. 

5. Finite sample evaluation. For the example in Section 1, we com- 
pared the penalized estimator with the local linear {R = 0) and the additive 
(R = oo) estimator and obtained a lower integrated squared error (ISE) for 
the penalized estimator. As seen in Figure 3, data-adaptive choice (speci- 
fied in Section 4.5) of the parameters R and h is successful: the theoretical 
improvement due to generalization holds also in practice. In Section 5.1, we 
will see whether this holds for other situations. Furthermore, we will inves- 
tigate how the estimator performs in the special case of an additive model; 
see Section 5.2. Finally (Section 5.3), we apply our estimator to the ozone 
dataset already analyzed by Hastie and Tibshirani (1990). 

5.1. Nonadditive regression function. We will examine 50 realizations 
of the same kind as in Section 1. Later on, we summarize the effect of a 
nonuniform design density and a larger sample size. 

In the example in Section 1, the optimal penalty parameter is larger than 
zero and the penalized estimator outperforms the local linear estimator when 
using optimal parameters. The optimal parameters are approximated suffi- 
ciently well by AICo 

Here we generate 50 realizations of the data as follows: the design con- 
sists of 200 random observations X,, uniform in [0,1] 2 . The response Yi is 
r tmc (Xj) + £j, where r true is shown in Figure 1(b) [1 = (1, 1) T ]: 

(20) r truc (x) = iSe" 32 "^ 1 / 4 )^ 2 + 35 e - 128 ^-( 3 / 4 ^l 2 + 25 e - 2 U^ 1 /2)I U 2 

and Si is normally distributed (a = 5). 

In order to find the optimal parameters for each realization of the design, 
we calculated the ISE for different pairs (R, h) [see Figure 3(a)] and per- 
formed a grid search. Actually, (j^,log 10 (^)) is equidistant with resolution 
(0.01,0.005). Similarly, we find the minimizers of AICc and GCV. 

The simulation is summarized by 50 realizations of ISE evaluated for 
the penalized, the local linear and the additive estimator using optimal and 
data-adaptive parameters. 

Let us introduce some notation. The global minimum at (-^opt> ^opt) is 
ISE(opt). The minimum of the local linear estimator (R = instead of 
for numerical reasons) is ISE(opt, R = 0) and the minimum of the additive es- 
timator (R = 9999 instead of oo) is ISE(opt, R = oo). Data-adaptive param- 
eters (-Raic>^aic) are obtained by finding the minimizer of AICc (Section 
4.5) on a grid. The corresponding ISE is denoted by ISE(AIC). Analogously, 
we write ISE(AIC, R = 0) for the local linear and ISE(AIC, R = oo) for the 
additive estimator. 
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Table 2 and Figure 4 summarize the results of this evaluation. Given the 
optimal parameter values for R and h, penalized estimation has clearly the 
potential for improvement compared to fitting the full model with a median 
percentage gain of 17% [item (a) of Table 2 and Figure 4]. This relative 
gain is larger for realizations with a small ISE. The additive estimator is not 
competitive and will hence not be shown. 

To be able to achieve these gains in practice, we need a good method for 
parameter selection. The corrected Akaike criterion AICc is such a method, 
and is moreover computationally attractive. When comparing (c) of Table 2 
with (a), we see that the performance based on estimated R and h is almost 
as good as that based on optimal parameters. Item (b) shows that data- 
adaptive parameter selection via AICc is attractive: a median increase in 
relative ISE of only 10% has to be tolerated. 

Interestingly, application of the full model with optimal bandwidths is 
clearly inferior to using the penalized estimator with data-driven parameter 
selection [see item (d)]. 

Other situations. The above simulation was also carried out for two 
nonuniform designs on [0,1] 2 , 

fi(xi,x 2 ) = \ + \{x\ + x 2 ) and f 2 (xi,x 2 ) = § - \(x\ + x 2 ). 

Both are linear in {x\ + x 2 ) and have range (0.5, 1.5). Density /i is preferred 
over f 2 because of the high peak in r truo at (0.75,0.75). This is reflected 
by the optimal penalty i? pt: compared with the uniform design, f\ needs a 
larger and f 2 a smaller penalty; see Table 3(e). Similarly, the ideal relative 
gain due to penalizing is larger for f\ and smaller for f 2 , item (a). Again, the 
performance remains almost as good when selecting parameters R and h via 
AICc [see item (c)]. The penalized estimator with AlCc-selected parameters 
is better than the local linear estimator with optimal parameters; however, 
for density f 2 the difference becomes smaller. 



Table 2 

Quantiles for ideal relative gain due to penalizing (a), for loss due to AICc 

selection (b), for relative gain (loss) due to penalizing for data-adaptive 
parameters (c), for relative difference between data-adaptive penalized and 
optimal full estimator (d) and for R op t (e) 

Model defined in (20) 

/ \ ISE(opt,fl=Q)-ISE(opt) 
W ISE(opt) 
/. \ ISE(AIC)-ISE(opt) 

y°) isE(opt) 

/ \ ISE(AIC,.R=0)-ISE(AIC) 
y C ) ISE(opt) 

/ ,\ ISE(opt,fl=0)-ISE(AIC) 
W ISE(opt) 

(e) -Ropt 



min 


10% 


med 


90% 


max 


1.8% 


4.9% 


17% 


49% 


65% 


0.2% 


1.1% 


10% 


24% 


66% 


-8.0% 


1.3% 


16% 


47% 


149% 


-50% 


-9.7% 


11% 


27% 


51% 


0.03 


0.06 


0.12 


0.25 


0.32 





Fig. 4. Comparison of ISE performance for the nonadditive regression function (20). 
Relative gain due to penalizing depending on ISE(opt) (a) [y = ISE(opt, R = 0)/ISE(opt) 
vs. x = ISE(opt)], effect of AICc selection (b) [y = ISE(AIC) us. x = ISE(opt)], com- 
parison of penalized vs. full modeling, parameters data-driven (c) [y — ISE(AIC, R = 0) 
vs. x — ISE(AIC)], comparison of penalized modeling (parameters data- adaptive) with full 
modeling (optimal bandwidth) (d) [y = ISE(opt, R = 0) vs. x = ISE(AIC)] . 



When doubling n to 400, parameter selection is improved; see Table 3, col- 
umn "400," item (b). Because of the smaller R Q pt, the gain due to penalizing 
is smaller but still not negligible. 

Parameter selection by GCV instead of AICc shows the same pattern 
(data not shown). 

5.2. Additive regression function. For additive regression functions, the 
question arises whether we pay a price for the additional flexibility of penal- 
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Table 3 

For different design densities we compare the medians of the 
same quantities as in Table 2 



Model defined in (20) 


fl 


unif 


h 


400 


(a) 
(b) 


ISE(opt,fl=0)-ISE(opt) 
ISE(opt) 
ISE(AIC)-ISE(opt) 
ISE(opt) 
ISE(AIC,fl=0)-ISE(AIC) 

ISE(opt) 
ISE(opt,i?=0)-ISE(AIC) 
ISE(opt) 

^?opt 


19% 
10% 


17% 
10% 


6% 
4.3% 


12% 
4.5% 


(c) 
(d) 
(e) 


20% 
11% 
0.15 


16% 
11% 
0.12 


11% 
1.5% 
0.10 


13% 
6.7% 
0.09 



The number of realizations is 50, the sample size is n = 200 and 
the name of the column denotes the design density — except the 
last column (n = 400, uniform). 



izing local linear estimation compared with additive estimation. Therefore, 
we choose an additive regression function and examine 50 realizations. 
We generated data using the regression function 

2 

r truc (x) = V (^ e -32(a!fc-l/4) 2 + ^ e -128(x fc -3/4) 2 + & e -2(x k -l/2) 2 y 
k=l 

Uniform design Xj and errors £j (a = 5) are the same as in Section 5.1. 
Estimating the additive model can be considered easy, as the data are rich 
enough for multivariate local linear estimation. 

Since AICc has no problems with undersmoothing, we ignore in the simu- 
lations the impracticable condition in Section 4.5 — excluding undersmoothing — 
which was imposed for classical AIC. For AICc the additive model is de- 
tected in 47 out of 50 cases, as i?Aic attains the maximal value. In the 
remaining three realizations we obtained a relative loss in ISE of 0.6% and 
5.2% in two cases; a gain of 3.5% was achieved in one case. Hence, model 
choice by AICc was successful in this example. 

Model selection by GCV detected the additive model in 24/50 cases only, 
whereas classical AIC failed completely (0/50). 

5.3. Application to ozone data. We apply our method to the ozone dataset 
using three out of nine predictors. The penalized estimator detects relevant 
deviations from an additive model. The local linear estimator produces ar- 
tifacts, which do not occur in the penalized estimator. 

We used the ozone dataset from the R package gss; see Hastie and Tibshirani 
[(1990), Section 10.3]. The variable "wind speed" (wdsp) contains one exces- 
sive value (observation number 92) which was removed, leading to n = 329. 
The dependent variable Y was chosen as the logarithm of the "upland ozone 
concentration" (upo3). Using gam (package mgcv), we chose those three 
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predictors which maximize adjusted R-squared among additive models with 
bivariate interaction terms with 16 degrees of freedom each: "humidity" 
(hmdt), "inversion base height" (ibtp), and "calendar day" (day). 

Note that this additive model with bivariate interaction terms has roughly 
the same adjusted R-squared as the additive model with all nine predictors 
and four degrees of freedom for each component; see Table 4. Hence, when 
using these three predictors, we expect substantial information in the inter- 
action terms. 

The three variables in this model were scaled to [0,1]. Let univariate 
bandwidths hi, /12 and /13 correspond to four degrees of freedom each, as in 
Hastie and Tibshirani (1990). These bandwidths lie close together (min = 
95% max) with mean h = \lh\hihj, at 0.237. Parameters R and c are selected 
by AICc, such that the bandwidths are ch\, c/12 and 0/13, respectively. 

For the penalized estimator, AICc selected R = 0.04 and ch = 0.2065 and 
is clearly nonadditive. For the local linear estimator, AICc selected ch = 
0.240. The lower half of Table 4 demonstrates that the penalized estimator 
is vastly better than the additive one in terms of adjusted ii-squared, and 
slightly better than the local linear estimator. 

Next, we orthogonally decompose the local linear and the penalized es- 
timator into intercept, additive components, bivariate interactions and re- 
mainder. Penalizing shrinks the bivariate interactions and the remainder; 
see Table 5. 

Figure 5 compares the local linear and penalized estimators, univariate 
components on the top and the largest bivariate interaction (ibtp, hmdt) 
on the bottom. The plots for univariate components demonstrate those reg- 
ularization properties of the penalized estimator. The plot in the center 
of the bottom row shows the design and the smoothing windows for one 



Table 4 

Adjusted R-squared for different models and estimators 



Estimator 




# independent variables 


.R-squared 
(adj.) 


Regression 


additive 


9 


82.5% 


spline (gam) 


additive 


3 


73.9% 


d/ = 4|16 


additive + bivariate interaction 


3 


81.3% 


Penalized 


£ H multivariate (i?=10~ 4 ) 


3 


81.7% 


local linear 


£ fl penalized (R = 0.04) 


3 


82.5% 


AICc 


ladd additive (7? = 00) 


3 


73.9% 



Above, we use regression splines with a fixed number of knots. Below, we use local lin- 
ear estimators with AICc selected parameters. The two additive estimators with three 
predictors are equivalent to each other but inferior to the rest. 
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Table 5 

Orthogonal decomposition of estimation on a grid into constant, additive, 
interaction and remainder components 



R 


h 


r 






r 3 


T"12 


^13 


r 2 3 


7*123 


1(T 4 
0.04 


0.240 
0.207 


3.95 
3.96 


0.482 
0.478 


0.054 
0.051 


0.025 
0.017 


0.023 
0.009 


0.082 
0.023 


0.019 
0.002 


0.059 
0.011 



We compare the mean squares of each component for the penalized (R — 0.04) 
and for the local linear (R = 10 -4 ) estimator. Penalizing shrinks interaction 
and remainder components. Univariate additive components are slightly re- 
duced. The indices are 1 = ibtp, 2 = day and 3 = hmdt. 



output point. Keep in mind that the smoothing windows are actually three- 
dimensional cubes and hence not all points inside the rectangle actually 
contribute to the local linear estimator. 

Let us mention that parameter selection criteria such as AICc and GCV 
evaluate the estimator at design points and hence are not influenced by its 




Fig. 5. Comparison of penalized and local linear estimators. Above, the univariate ad- 
ditive components are shown. Below, the bivariate components of ibtp and hmdt are com- 
pared. In between, the design is shown including the smoothing windows at (0.73,0.31). 
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behavior in sparse regions. Comparing the local linear estimator with the 
penalized estimator, we observe some strange structure for the former at 
hmdt = 0.3. This is clearly an artifact, as for day > 0.75 and ibtp > 0.75, the 
local linear estimator is an extrapolation. 

We conclude that the penalized estimator outperforms the additive esti- 
mator and is also superior to the full estimator regarding adjusted i?-squared 
and regularization properties. 

Reproducing simulation results. An implementation of f_ R together with 
the R code used in the simulations of this paper is provided at www.biostat.unizh.ch/ 
Software. 

APPENDIX 
A.0.1. Assumptions and details. 

A. 0.2. Conditions for optimality of the MLN estimator. MLN show that 
the estimator r add is asymptotically equal to the oracle estimator if (Yj, 
are i.i.d., the true regression function r truc (Xj) =E[Yi|Xj] is additive (1) 
and the following conditions hold: 

Condition MLN:B1. The kernel K is bounded, has compact support 
([— C\,Ci\), is symmetric about zero and is Lipschitz continuous. 

Condition MLN:B2'. The d-dimensional vector X has compact sup- 
port [0, l] d and its density / is bounded away from zero and infinity on 
[0A] d . 

The product kernel with bandwidths h\ , . . . , is constructed from 
the univariate kernel K by -fT/^X, x) = Ilfc=i-^([2L — x]fc/^fc)/^fe- 

Furthermore, the kernel is rescaled at the boundary such that for all 

Xie[o,i] d , 

/ ^ft(Xj,x)dx = l. 

J[0,l] d 

This modification does not affect the local linear estimator, but it changes 
its projection to the additive model. Hence, the estimation of the marginal 
design density is equal to an integrated full-dimensional density estimation. 
Additionally to MLN, we assume K > 0. 

Condition MLN:B3'. For some 9 > 5/2, E[\Y\ e ] < oo. 

Additionally to MLN, we assume E[e 4 ] < oo in Sections 4.5 and 4.6. 
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Condition MLN:B4'. The true regression function r true (x) = E [Y\ X = 
x] is twice continuously differentiable and / is once continuously differen- 
tiable. 

Condition CI. Assume there exist constants with n 1 / 5 /^ — ► Cf., k = 
l,...,d. 

Condition C1+. The bandwidths hi, . . . , are as in Condition CI or 
larger. As a matter of course we assume that — > 0. 

A. 0.3. Definition of Z. In Section 2.2 the output grid tj, j = 1, . . . ,m, 
and the parameters (3 = coL/(/3^) = coL/(col£(/3p^)) were introduced. In Sec- 
tion 3.2 P a( jd wa s decomposed into the product Z T Z. Instead of writing 
down the matrix Z, we show what Z does with a vector (3 £ Ff u u. For the in- 
dex ranges we use j = 1, . . . ,m; £ = 0, . . . , d; k = 1, . . . , d. Let (3 . = colj(P^). 
Define 

Z[3 = col(Z i/3 , Z 2 ^ , . . . , Z d ^ , Z Q i(3 1 ,Z 2{3 2 , Z d/3J, 

where Zofc/3 fc adds those which have the fcth coordinate of tj in com- 
mon, 

t\ =tj,k 



— « V 777, ^-r' 

For identifiability, all additive components of the intercept except the first 
one should have mean zero. Therefore, Zfc/3 Q is defined as Zo/c/3 with sub- 
tracted mean. 

We did not modify Z^ to have full rank, as this makes implementation 
more complicated, and it appears that the additional computing steps offset 
the gain due to lower dimension. Z T Z is a projection and hence ZZ T is too. 

A.0.4. Proofs. 

A. 0.5. Algorithm, structure and proofs for Section 3.2. 

Deriving (7). Rao and Kleffe (1988) provide a generalized inverse for 
the matrix B + CDC T , 

B - B CD(I + C T B CD) C T B~. 

This holds if B and D are symmetric and if the image of B contains the 
image of C. We apply this formula for (B, C, D) = (S + RI, Z T , —Rl). 
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Verify that Ps,r is a projection from Ff u u to F ac jd- Define 

(21) A StR := I - ZA R Z T = (I - ZZ T ) + Z(I - A fl )Z T . 

Because I — A R > 0, the image of As tR contains the image of Z(I — A R ). 
We get 

(22) As,flAi fJl Z(I-Afl) = Z(I-Afl). 
Furthermore, 

(23) (I-ZZ T )Z = 
and the definition of As t R implies 

(24) (I-ZZ T )A Sifi = (I-ZZ T ). 
Applying (22)-(24) leads to 

(25) (I — ZZ T )Ag R Z(I — Ar) = 0. 
Ps,/?, = Z T Ag fl Z(I — A^j) is a projection because 

Pl R = Z^^I-A^Ag^I-A*) 

( = ] Z T A SR (A S , R - (I - ZZ T ))A Sj/? Z(I - Aft) 

(25) 



Z A sr A s ,rA s h Z(I - A fl ; 



(22) 



Z'As ii? Z(I-A i? ) = P s , R 

and (I — A/j)Pgjj is symmetric. If As,/? is nonsingular, the image of Ps,_r 
is F add . Let us verify that Ps^Padd = Padd: 

(I - Ps, fi )Z T Z = Z T (I-A S 1 /? Z(I-A R )Z T )Z 

(2 = 1) Z T (I-As 1 i? (A Sifl -(I-ZZ T )))Z 

(2 = 3) Z T (I-A s 1 R As, fi )Z = 0. 
Formula (8) is straightforward. 

A. 0.6. Iterative formula and proofs for Section 3.3. 

Convergence of iterative algorithm in (9) in the case of nonuniqueness. 
We denote by (ZA R Z T )°° the projection to the subspace of eigenvectors 
of ZA R Z T with eigenvalues 1. Because Z(I — A R )f3 ,, is orthogonal to the 
above subspace, (ZA R Z T )°°j_^ = (ZA R Z T )°°y_W and I - (ZA R Z T )°°j_^ 
converges. 
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Iteration formula for large R, deriving (10). Starting with (8), multiply 
by Z, replace Ps,R0u by Z T Z/3^, subtract Z/3 , multiply by <xR, add Z/3 , 
and apply (I - ZZ T )Z@_ R = because of (23). 

A. 0.7. Properties on grid and proofs for Section 4.1. Proposition 1 is an 
interpretation of (8) and requires no proof. 

Proof of Proposition 2. We need to prove that P a dd3 R = f s,rP_ u , 
respectively 

P add ((I-A fl )(I-P S)jR )) = 0. 
Transposing and applying the symmetry of the matrix, this is equivalent to 

(I-A R )(I-P S)jR )P add = 0. 
This holds because Ps,i?P a dd — Padd [if and only if (5) is unique]. □ 

A. 0.8. Definition ofP*,R and proofs for Section 4.2. We use the abbrevi- 
ation 5 add = 7'add<S*7'add restricted to J-add- Hence, 5 add is a linear operator 
on .F add , an d it has a continuous inverse with probability tending to 1 for 
n — ► oo; see Lemma 2 in Section 4.4. 

Define the operator A*^ : .F add ^&dd'- 

^-*,R = 'Padd(S* + R3) ^"^add- 

If S~ dd exists and is continuous, A~^, is continuous because 

> — 7j<S a dd- 

II 4 - 5 * 1 1 2, sup + tt 

Let us define the projection V* t R : .Ffuii — > ^kdd by 
(26) V*,r = P add A;^ add (cS* + Ri^s* . 

5* is continuous (||«S*||2,sup < °°)> because kernel weights are bounded and 
have compact support (Condition MLN:B1). 

Below we will verify that the choice for f_ R in (13) in Section 4.2 satisfies 
the normal equation (12). We need the properties VnddT 7 *^ = V*,r and 

- (<?* + Ejy l R)v*, R ( = } A^A;^ add (S* + ej)- 1 s* 

(27) 

= £add(<->* + Rj) 5*. 
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We have to verify the normal equations (5* + R(J — V &dd )) Lr = $* L if 

(S* + R{3-V add ))r R 

= + KJ) - RV add )(S* + Rd)- 1 ^ + RV*, R }? U 

= S*? u + RT add {(I - (S* + EJ)- l R)V*, R - («S, + Rd)-^*}? u 
{ = ] r u + R{V add (S, + R3) ~ l S* - V add (<S* + R3) ~ l S*} r u = ? u . 
While ?u may be ambiguous, S*?u and (if exists) V*,rLii are unique. 

Proof of Proposition 3. In (14), V % r?u g In order to show 
that the other term is orthogonal to .F add , we prove that 

V add (S, + RJ)-^^ - V*,r) = 0. 

This holds because (S* + EJ)" 1 ^ = 3 - (S* + R3)- 1 R and (27). □ 

Proof of Proposition 4. The orthogonal projection from T to jF add 
is the same for || • ||* and for || • \r\ 

argmin || r — Haddll* = argmin \\r — r add ||# V r G J 7 , 

Hadd^-^add ladd^add 

because the penalty R\\(J — 'P a dd)'Po(r — H add ) 1 1 1 does not depend on r add . 
Consequently, we may exchange the two norms when projecting to J^dd) 
and we may simplify nested projections: 



~P* Lr = arg min 

ladd^^add 



Hadd - argmm || r - r Y \\R 

iG^full 



- add 



£-7"add 



argmm £ add — argmin || r — r Y \\r 



re^fuii 



R 



= argmm || r add - r Y \\ R = argmin || r add - r Y ||* = r add . 

ladd^add ladd^add U 

A. 0.9. Proofs for Section 4.3. 

Proof of Proposition 5. The proof follows from Gao (2003). In 
particular, continuity on R d of the design density / does not hold. How- 
ever, Condition MLN:B2' states that / is bounded on [0, l] d . For an upper 
bound of Soo(*0, we choose some smooth density / and a constant c with 
/(x) < c /(x) and add |_( c — l) n J virtual observations with distribution 
(c/ — /)/(c— 1). The density estimator based on all [en J observations is 
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bounded in probability and So,o/ c is smaller. The boundary adjustments 
are handled analogously. □ 



Proof of Lemma 1. Because of Lemma 2 and Proposition 5, 5* and 
'-'add are bounded. In Appendix A. 0.8, the claim is already shown for small R 
and (26). Using (17) in Section 4.4, the proof is straightforward for large R. 
□ 



A. 0.10. Continuity in R = oo and proofs for Section 4.4. Continuity in 
R = requires no proof. 



Proof of Lemma 2. 



Overview. MLN showed in Theorem 1' that the estimator r add is unique 
with probability tending to 1 for n — > oo. Uniqueness is equivalent to || r ||* > 
for all r £ F&dd with \\r ||2 = 1. Using their technique of proof, we may 
even show that the above norm is bounded away from zero with probability 
tending to 1. In this case 5 add has a continuous inverse with respect to || • ||2- 



Sketch of proof for Theorem MLN:l'. The normal equations for r add are 

5 a dd£add = ^addZ:L- 

Define the matrix Mfc(x) which depends only on the one-dimensional data 
(Yi,X itk ), i = l,...,n: 



1 n r 

M fc (x) = -^ / K h (X i) x)dx_ fc y i 

Tb ■ -.J 
1=1 



1 



A^i,fc -Eft 



hh. 



i,k -£fe 

hk 



X 



i,k 



X k \ 



hk 



■ 



With probability tending to 1, M fc 1 (x) is continuous, and in this case some 
f is obtained by a continuous mapping of VnddLi'- 



-add 



^"^add + ^; 



where T is some contraction. Therefore the solution is unique and VgddLL l— * 
r add is continuous. Both T and r depend on M^ 1 and the two-dimensional 
empirical marginal distribution of Xj- 

Because "PaddTLi even when choosing arbitrary values for Y, does not 
occupy J^dd) we cannot (yet) conclude that S~^ d exists and is continuous. 
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Definition ofV*. Let us now examine the orthogonal (with respect to 
II • II*) projection V* from JFf u n to Tadd, 

P*L = argmin ||r - r add ||*. 
The normal equations are 

<5add ZLadd = ^add<S* T . 

By choosing r appropriately, one can prove that V ax id'S*L ► £ a dd is contin- 
uous. Because of the uniqueness of V* , the image of .Ffuu under the mapping 
'Padd'S* is equal to ^ a dd and therefore S~ d \ exists and is continuous. 

Convergence o/5 a dd- The operator 5 add depends on bivariate terms only. 
Under Conditions MLN:B1, MLN:B2' and CI, these terms converge to their 
theoretical counterparts, which depend on the design density /. Further- 
more, MLN argued that T is a contraction (with probability tending to 1) 
because it converges to T, which is a contraction [MLN: (69)]. 

Bandwidths larger than n" 1 / 5 . The above calculation assumes that h is 
proportional to n -1 / 5 . In the proof of MLN, one piece was the convergence 
of T to T , which depends only on the theoretical design density /. In case of 
oversmoothing, variability is reduced and the expected part is not critical, 
as Condition MLN:B2' holds also for smoothed /. Hence, Condition CI may 
be replaced by Condition C1+. □ 

Proof of Lemma 3. HPaddEzlla is essentially univariate and therefore 
Op(1). 

Define 7\dd+ ^ifuii — ► -^add via (P a dd+Zl) (x) = Efc=i / »" (x) dx -k and 
(^ > add+Zl) fc (x) = / r fe (x) <ix By construction, 7 3 a dd+ is monotone: if r^(x) > 
r € U) (ye, x), then (P a dd+r) £ (x) > (Padd+r)^(x). Note that P a dd is not 
monotone. Denote by Vr the contraction 3 — (3 + R~ 1 S*)~ 1 which is a point- 
wise linear transformation with ||^>_r || 2,sup < R~ 1 ||«S*|| 2,sup- 

We want to prove that "Padd+^RZLL is arbitrarily small when R~ 1 S* is 
small enough. Let us sketch the proof in a simplified case. If T>r were diago- 
nal, we would use the monotonicity of "P a dd+ to obtain an upper bound by re- 
placing r£(x) by its absolute value (V£, x) and enlarging Vr to ||£>.R||2,sup3: 

(28) ||(Padd+^£i)(x)||2 < ||^||2,sup||(^add+Z:|L|)(x)|| 2 , 

where r_\n denotes r L with absolute values. The pointwise upper bound for 
IK'Padd+lLXxJIk remains valid if Yi and X^j. — Xk are replaced by \Yi\ and 
\Xi t k — Xk\, respectively. 

In practice, positivity of all components is generally not preserved under 
multiplication by Dr. By Condition MLN:B1, the kernel K has compact 
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support [— Ci,C\] and therefore the slope terms of r_L are bounded by the 
intercept, r^(x) < Cir£(x). The norm ||(I ) Hr £/ )(x)||2 is bounded pointwise 



in x by Jl + C?d\\V R h^Yl,iKh(Xi,xM\- 



Let r| L |(x) = ^l + Cfd^Ei=iK h (Xi,K)\Yi\(l,...,l) and (28) holds. 
Again, V a dd+ L\l\(^) depends on univariate terms only, which is essential 
in high dimensions, where ^ Ya=i Kh{^Li, x) is not of constant order. 

We conclude that sup* UPadd+^RL^i^h = Op( J R~ 1 ||5 Jf || 2 ,sup). □ 

Proof of Lemma 4. Let a = r y — Zr, b = f_ R — r add and c = a + 
b = r_Y ~ £add- Because r R minimizes \\r — £y||jj and ||£ add ~~ Hy||ii is 
independent of R, 

( 29 ) \H\r < \\c\\r = ||c||*. 
Obtain a bound for ||(J — P a dd)L R \\2 using (4): 

R\\0 -V^LrWI ( = ||«||* < ||c||* - = (c-a,c + a)* 

= (6, 2c -6)* <2<6,c}* <2||&||*||c||*. 

Then 

IIP - P^LrWI < 2E _1 (V||5*|| 2 , S up|| LR ~ £addl|2)|| L Y ~ £addll*- 

Because 5o,o(x) is a density estimate, ||<S*||2,sup > 1 and we omit the square 
root. 

Because r add is a minimizer, ||r Y - £ add ||* < ||r y ||^ = \Ya=i y ? ■ For 
increasing n, this is Op(l) because of Condition MLN:B3'. □ 

Proof of Theorem 2. As the expected part is additive [up to op(h 2 ) 
terms], the nonadditive part consists only of the variance terms: ||(J — 
P*,r)lu\\1 = Op(^j-z). By Proposition 4, r^ d -P e d d r R = P*(3-P add )r R , 
where P* is continuous (with probability — ► 1). By Proposition 3, (J — 
Padd)LR = (<S* + R3)~ 1 S*('J — P*,r)zu and the claim follows from ||(<S* + 
R3)~ 1 S* ||2,sup < | 1 1 2,sup and uniform continuity of P* tR (Lemma 1). 
□ 

A.0.11. Model selection by AIC {Section 4.5). 

Proof of Lemma 5. We use a formula from Rao and Kleffe [(1988), 
pages 31ff], 

(30) cov(e T B£,£ T Ce) = 2cr 4 tr(BC) + K<r 4 tr(Bdiag(C)), 
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where B, C are symmetric matrices of dimension n and E[ef] = (3 + k)o . 
Using B = C = ^(Mj + M R ) we obtain 

var((£,M i? ,e)) = a 4 (tr(M fl Mji) +tr(MjM fl ) + «tr ( diag(M if ) 2 )). 

Note that HM^y 2 ^ = tr(M^M/j) is known as a Hilbert-Schmidt norm 
and tr(M fl M K ) = (M^,M R ) HS is bounded by ||Mfl|| HS ||Mj|| hs- Hence 
vw({e,M R e)) < (2 + |k|)ct 4 tr(M^M R ) and E[||M fl e|| 2 ] =<T 2 tr(MjM fl ). 

Analogously, using tr^M^M/j) 2 ) < ||M^Mft|| sup tr(M^M^), the vari- 
ance of ||M R e|| 2 is smaller than (2 + \n\)a 2 \\M R \M R \\ snp E[\\M R e || 2 ]. □ 

A. 0.12. Proofs for Section 4.6. 

Proof of Proposition 7. It is well known that E[||M«£|| 2 ] oc -4^ and 
E[||M ac jd£ || 2 ] oc Let us start with the A 2 terms: Because ^3-||M ac jd£|| 2 
is of smaller order than — s || 2 , the mixed term -^(M^ e , M a dd£) is 
bounded by the Cauchy-Schwarz inequality. Furthermore, (1 — E)[^j||M^ e || 2 ] 
is negligible compared to E[^j||M^ e || 2 ] (see Lemma 5), indicating that the 
inverse of the A 2 terms is Op{nh d ). 

For the A-linear terms, it is obvious that -^(l — E)[(e, M^e)] is the 
largest stochastic term. It remains to show that - 2 -rE[(M^e,M a dd£)] is 
nonnegative, as a nonnegative coefficient of A indicates that the minimum is 
at A m j n < (R = oo). As we are using a fixed uniform design, the local lin- 
ear and the Nadaraya- Watson estimator coincide (ignoring the boundary). 
Hence, we are interested in the covariance of a multivariate and a univari- 
ate Nadaraya- Watson estimator with nonnegative kernel weights, whose hat 
matrices have therefore nonnegative elements. □ 

Acknowledgments. The authors would like to thank the editors and ref- 
erees for their helpful comments that led to a considerable improvement of 
this paper. 

REFERENCES 

Fan, J. (1993). Local linear regression smoothers and their minimax efficiencies. Ann. 

Statist. 21 196-216. MR1212173 
Fan, J., Gasser, T., Gijbels, I., Brockmann, M. and Engel, J. (1997). Local polynomial 

regression: Optimal kernels and asymptotic minimax efficiency. Ann. Inst. Statist. Math. 

49 79-99. MR1450693 

Gao, F. (2003). Moderate deviations and large deviations for kernel density estimators. 
J. Theoret. Probab. 16 401-418. MR1982035 

Hardle, W., Hall, P. and Marron, J. S. (1988). How far are automatically chosen re- 
gression smoothing parameters from their optimum? (with discussion). J. Amer. Statist. 
Assoc. 83 86-101. MR941001 



36 



M. STUDER, B. SEIFERT AND T. GASSER 



Hastie, T. and Tibshirani, R. (1990). Generalized Additive Models. Chapman and Hall, 
London. MR1082147 

HuRVlCH, C, Simonoff, J. and Tsai, C. (1998). Smoothing parameter selection in non- 
parametric regression using an improved Akaike information criterion. J. R. Stat. Soc. 
Ser. B Stat. Methodol. 60 271-293. MR1616041 

Mammen, E., Linton, O. and Nielsen, J. (1999). The existence and asymptotic properties 
of a backfitting projection algorithm under weak conditions. Ann. Statist. 27 1443-1490. 
MR1742496 

Mammen, E., Marron, J. S., Turlach, B. and Wand, M. (2001). A general projection 

framework for constrained smoothing. Statist. Sci. 16 232-248. MR1874153 
Nielsen, J. and Linton, O. (1998). An optimization interpretation of integration and 

back-fitting estimators for separable nonparametric models. J. R. Stat. Soc. Ser. B 

Stat. Methodol. 60 217-222. MR1625624 
Nielsen, J. and Sperlich, S. (2005). Smooth backfitting in practice. J. R. Stat. Soc. Ser. 

B Stat. Methodol. 67 43-61. MR2136638 
RAO, C. R. and Kleffe, J. (1988). Estimation of Variance Components and Applications. 

North-Holland, Amsterdam. MR933559 
Seifert, B. and Gasser, T. (1996). Finite-sample variance of local polynomials: Analysis 

and solutions. J. Amer. Statist. Assoc. 91 267-275. MR1394081 
Seifert, B. and Gasser, T. (2000). Data adaptive ridging in local polynomial regression. 

J. Comput. Graph. Statist. 9 338-360. MR1822090 
Silverman, B. (1978). Weak and strong uniform consistency of the kernel estimate of a 

density and its derivatives. Ann. Statist. 6 177-184. MR471166 
Stone, C. (1980). Optimal rates of convergence for nonparametric estimators. Ann. 

Statist. 8 1348-1360. MR594650 
Stone, C. (1982). Optimal global rates of convergence for nonparametric regression. Ann. 

Statist. 10 1040-1053. MR673642 
Stone, C. (1985). Additive regression and other nonparametric models. Ann. Statist. 13 

689-705. MR790566 

Stone, C. (1986). The dimensionality reduction principle for generalized additive models. 

Ann. Statist. 14 590-606. MR840516 
Studer, M. (2002). Nonparametric regression penalizing deviations from additivity. Ph.D. 

dissertation 14696, Swiss Federal Institute of Technology, Zurich (ETHZ). 

Department of Biostatistics. ISPM 
University of Zurich 
Sumatrastrasse 30 
CH-8006 Zurich 
Switzerland 

E-MAIL: mstuder@ifspm.unizh.ch 



